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Abstract. Frequent itemset mining is an essential part of data analysis and data 
mining. Recent works propose interesting SAT-based encodings for the problem 
of discovering frequent itemsets. Our aim in this work is to define strategies for 
adapting SAT solvers to such encodings in order to improve models enumera¬ 
tion. In this context, we deeply study the effects of restart, branching heuristics 
and clauses learning. We then conduct an experimental evaluation on SAT-Based 
itemset mining instances to show how SAT solvers can be adapted to obtain an 
efficient SAT model enumerator. 


1 Introduction 

Frequent itemset mining is a keystone in several data analysis and data mining tasks. 
Since the first article of Agrawal ||T1 on association rules and itemset mining, the huge 
number of works, challenges, datasets and projects show the actual interest in this prob¬ 
lem (see II2TI for a survey of works addressing this problem). In ifTSl . De Raedt et al 
initiate a research trend on constraint programming and data mining. The authors pro¬ 
posed a framework for itemset mining offering a declarative and flexible representation 
with several generic and efficient CP solving techniques. Encouraged by the promising 
results of this framework, several contributions addressed other data mining problems 
using either constraint programming (CP) or propositional satisfiability (SAT) (e.g. 
lfT8l8l ? l5ll2lllll j. In this work, we focus on the SAT-based encoding of itemset min¬ 
ing problems ifT^ . In this new SAT application, the goal is to enumerate all the models 
of the propositional formula. 

Today, propositional satisfiability has gained a considerable audience with the ad¬ 
vent of a new generation of solvers able to solve large instances encoding real-world 
problems. In addition to the traditional applications of SAT to hardware and software 
formal verification, this impressive progress led to increasing use of SAT technology 
to solve new real-world applications such as planning, bioinformatics, cryptography. 
In the majority of these applications, we are mainly interested in the decision prob¬ 
lem and some of its optimisation variants (e.g. Max-SAT). Compared to other issues in 
SAT, the SAT model enumeration problem has received much less attention. Most of 
the recent proposed model enumeration approaches are built on the top of SAT solvers. 
Usually, these implementations are based on the use of additional clauses, called block¬ 
ing clauses, to avoid producing repeated models II16I4I17I14L Improvements have been 
proposed to this blocking clause based enumeration solvers (e.g. 1141171 1. In particular. 










the authors in ini proposed several optimizations obtained through learning and sim¬ 
plification of blocked clauses. However, these kind of approaches are clearly impracti¬ 
cal. Indeed, in addition to clauses learned form conflicts, one might add an exponential 
number of blocked clauses in the worst case. In ||2l, the authors elaborate an interest¬ 
ing approach for enumerating answer sets of a logic program (ASP), centered around 
First-UlP learning and backjumping. 

In ifTOl . we proposed an approach based on a combination of a DPLL-like procedure 
with CDCL-based SAT solvers in order to mainly avoid the limitation that concerns the 
space complexity induced by blocking clauses addition. In this work, we focus on the 
SAT encoding of the problem of frequent itemset mining that we introduced in ifT^ . In 
such encoding, there is a one-to-one mapping between the models of the propositional 
formula and the set of interesting patterns of the transaction database. Additionally, 
even for condensed representation such as closed or maximal itemsets, the size of the 
output might be exponential in the worst case. 

The work presented in this paper is mainly motivated by this interesting SAT appli¬ 
cation to data mining and by the lack of efficient model enumerator. 

Our aim is to study through an extensive empirical evaluation, the effects on model 
enumeration of the main components of CDCL based SAT solvers including restarts, 
branching heuristics and clauses learning. 


2 Background 

Let us first introduce the propositional satisfiability problem (SAT) and some necessary 
notations. We consider the conjunctive normal form (CNF) representation for the propo¬ 
sitional formulas. A CNF formula ^ is a conjunction (A) of clauses, where a clause is 
a disjunction (V) of literals. A literal is a positive (p) or negated (^p) propositional 
variable. The two literals p and ^p are called complementary. A CNF formula can also 
be seen as a set of clauses, and a clause as a set of literals. Let us mention that any 
propositional formula can be translated to CNF using linear Tseitin’s encoding ll22ll . 
We denote by VarifF) the set of propositional variables occurring in (p. 

A Boolean interpretation B of a propositional formula ^ is a function which asso¬ 
ciates a value B{p) £ {0,1} (0 corresponds to false and 1 to true) to the propositional 
variables p £ Var{<P). It is extended to CNF formulas as usual. A model of a formula P 
is a Boolean interpretation B that satisfies the formula, i.e., BfP) = 1. We note fAfP) 
the set of models of P. SAT problem consists in deciding if a given formula admits a 
model or not. 

Let us informally describe the most important components of modern SAT solvers. 
They are based on a reincarnation of the historical Davis, Putnam, Logemann and Love¬ 
land procedure, commonly called DPLL Q. It performs a backtrack search; selecting 
at each level of the search tree, a decision variable which is set to a Boolean value. 
This assignment is followed by an inference step that deduces and propagates some 
forced unit literal assignments. This is recorded in the implication graph, a central data- 
structure, which encodes the decision literals together with there implications. This 
branching process is repeated until finding a model or a conflict. In the first case, the 
formula is answered satisfiable, and the model is reported, whereas in the second case. 


a conflict clause (called learnt clause) is generated by resolution following a bottom- 
up traversal of the implication graph 1151241 . The learning or conflict analysis process 
stops when a conflict clause containing only one literal from the current decision level 
is generated. Such a conflict clause asserts that the unique literal with the current level 
(called asserting literal) is implied at a previous level, called assertion level, identified 
as the maximum level of the other literals of the clause. The solver backtracks to the 
assertion level and assigns that asserting literal to true. When an empty conflict clause 
is generated, the literal is implied at level 0, and the original formula can be reported 
unsatisflable. In addition to this basic scheme, modem SAT solvers use other compo¬ 
nents such as activity based heuristics and restart policies. An extensive overview can 
be found in [|3l . 

3 Frequent Itemset Mining 

Formally, we define the problem of mining frequent itemsets (FMI for short) in the fol¬ 
lowing way. Let 17 be a finite set of items. A transaction is defined as a couple {tid, I) 
where tid is the transaction identifier and I is an itemset, i.e., I f2. A transaction 
database is a finite set of transactions where the attribute tid refers to a unique itemset. 
We say that a transaction {tid, I) supports an itemset J if J C /. 

The cover of an itemset / in a transaction database T> is the set of transactions in T) 
supporting I: C{I,'D) = {{tid, J) G 27 | / C J}. The support of an itemset / in 27 is 
defined as the size of its cover: S{1, 27) =| C{1, 27) |. 


Tid 

Itemset 

1 

A,B,C,D 

2 

A,B,E,F 

3 

A,B,C 

4 

A, C, 77, F 

5 

G 

6 

D 

B 7 

D,G 


Table 1. A transaction database 27 


Let 27 be a transaction database over 17 and n a minimum support threshold. The 
frequent itemset mining problem consists in computing the following set: 

TTM{V,n) = {/ C 17 I 5(7,27) > n}. 


Definition 1 (Closed Freqnent Itemset). Let V be a transaction database (over 17) 
and I an itemset (I 'L f2) such that S{I,'D) > 1. The itemset I is closed if, for all 
itemset J f- f2 with I <Z J, we have that S{J, 27) < 5(7,27). 







One can see that all the elements of TXM. (V, n) can be obtained from the closed item- 
sets by computing their subsets. Enumerating all closed itemsets allows us to reduce 
the size of the output. We denote by CFIM.{T>, n) the subset of all closed itemsets in 
TTM{V,n). 

For instance, consider the transaction database described in Table [T] The closed 
frequent itemsets with the minimal support threshold equal to 2 are: CFIXA{T>, 2) = 
{A, D, G, AB, AC, AF, ABC, ACD}. 

4 A SAT Encoding of Frequent Itemset Mining 

In this section, we describe SAT encodings for itemset mining which are mainly based 
on the encodings proposed in ifT^ . In order to do this, we fix, without loss of generality, 
a transaction database T) = {(1, Ii),..., (to, Im)} a minimal support threshold n. 

The SAT encoding of itemset mining that we consider is based on the use of propo¬ 
sitional variables representing the items and the transaction identifiers in V. More pre¬ 
cisely, for each item a (resp. transaction identifier i), we associate a propositional vari¬ 
able, denoted Pa (resp. qA. These propositional variables are used to capture all possible 
itemsets and their covers. Formally, given a model B of the considered encoding, the 
candidate itemset is {a G 17 | B{pa) = 1} and its cover is {z G N | B{qi) = 1}. 

The first propositional formula that we describe allows us to obtain the cover of the 
candidate itemset: 

m 

V 

This formula expresses that qi is true if and only if the candidate itemset is supported 
by the transaction. In other words, the candidate itemset is not supported by the 
transaction (qi is false), when there exists an item a (pa is true) that does not belong to 
the transaction (a G 17 \ lA. 

The following propositional formula allows us to consider the itemsets having a 
support greater than or equal to the minimal support threshold: 

m 

X] 9* > tz (2) 

i=l 

This formula corresponds to 0/1 linear inequalities, usually called cardinality constraints. 
The first linear encoding of general 0/1 linear inequalities to CNF have been proposed 
by J. P. Warners in ll2^ . Several authors have addressed the issue of finding an efficient 
encoding of cardinality (e.g. M20119121 ') as a CNF formula. Efficiency refers to both 
the compactness of the representation (size of the CNF formula) and to the ability to 
achieve the same level of constraint propagation (generalized arc consistency) on the 
CNF formula. 

We use to denote the encoding corresponding to the conjunction of 

the two formulae Q and ([^. Then, we have the following property: S is a model of 
n) iff / = {a G J7 I B{pa) = 1} is a frequent itemset where C{1, 27) = {/ G 
N I BiqA = 1}. 




We now describe the propositional formula allowing to force the candidate itemset 
to be closed; 

m 

A(A (ji —>■ a S /i) —>■ Pa (3) 

aGf2 i—1 

This formula means that if we have S{I, V) = S{I U {a}, V) then a € I holds. This 
condition is necessary and sufficient to force the candidate itemset to be closed. Let 
us note that the expressions of the form a G li correspond to constants, i.e., a G li 
corresponds to T if the item a is in li, to _L otherwise. 

Note that the formula ([^ can be simply reformulated as a conjunction of clauses as 
follows: 

A (( V 9*)VPa) (4) 

This reformulation is obtained using the equivalence A ^ B = -^A V B. 

We use £cfim{T^, n) to denote the encoding corresponding to the conjunction of 
the formulae Q, ([^ and Q. Then, we have the following property; ;B is a model of 
£cFiM(T^,'n) iff / = {a G 17 I B{pa) = 1} is a closed frequent itemset where 
CiI,V) = {ie-M\B{q,) = l}. 

Example 1. Let us consider the transaction database of Table The Problem encoding 
the enumeration of frequent closed itemsets with a threshold 4 can be written as: 

{^9i i-G {pe y Pf'^ Pg), 

^92 ^ {Pc y Pd^ Pg), 

^93 ^ (pd V pb V pf V pg), 

O (pb V pb V pg), 

^95 ^ (PA V Pb V PC V Pb V Pb V pp), 

O (pa V Pb V pc V Pb V Pb V pc), 

^95 ^ {PA V Pb V Pc V Pb V pp), 

(95 V (76 V 97 V pa), 

(94 V 95 V 96 V 97 V Pb), 

(92 V 95 V 96 V 97 V pg), 

(92 V 93 V 95 V Pd), 

(92 V 93 V 94 V 95 V 96 V 97 V pp), 

(9i V 93 V 94 V 95 V 96 V 97 V pp), 

(9i V 92 V 93 V 94 V 96 V pg), 

9i + 92 + 93 + 94 + 95 + 96 + 97 > 4} 

5 Enumerating all Models of CNF Formulae 

A naive way to extend modern SAT solvers for the problem of enumerating all models 
of a CNF formula consists in adding a blocking clause to prevent the search to return 
the same model again. This approach is used in the majority of the model enumera¬ 
tion methods in the literature. The main limitation of this approach concerns the space 
complexity, since the number of blocking clauses may be exponential in the worst case. 
Indeed, in addition to the clauses learned at each conflict by the CDCL-based SAT 


solver, the number of added blocking clauses is very important on problems with a 
huge number of models. This explains why it is necessary to design methods avoiding 
the need to keep all blocking clauses. It is particularly the case for encodings of data 
mining tasks where the number of interesting patterns is often significant, even when 
using condensed representations such as closed patterns. 

Our main aim is to experimentally study the effects of each component of modern 
SAT solvers on the efficiency of the model enumeration, in the case of the encoding 
described in Section As the number of frequent closed itemsets is usually huge, this 
results in a huge number of models for the considered encoding. Consequently, it is 
not suitable to store the found models using blocking clauses during the enumeration 
process. 

We proceed by removing incrementally some components of modern SAT solvers 
in order to evaluate their effects on the efficiency of model enumeration. The first re¬ 
moved component is the restart policy. Indeed, we inhibit the restart in order to allow 
solvers to avoid the use of blocking clauses. Thus, our procedure performs a simple 
backtracking at each found model. The second removed component is that of clause 
learning, which leads to a DPLL-like procedure. Considering a DPLL-like procedure, 
we pursue our analysis by considering the branching heuristics. Indeed, our goal is to 
hnd the heuristics suitable to the considered SAT encoding. To this end, we consider 
three branching heuristics. We first study the performance of the well-known VSIDS 
(Variable State Independent, Decaying Sum) branching heuristic. In this case, at each 
conflict an analysis is only performed to weight variables (no learnt clause is added). 
The second considered branching heuristic is based on the maximum number of occur¬ 
rences of the the variables. The third one consists in selecting the variables randomly. 


6 Experiments 

We carried out an experimental evaluation to analyze the effects of adding blocking 
clauses, adding learned clauses and branching heuristics. To this end, we implemented 
a DPLL-like procedure, denoted DPLL-Enum, without adding blocking and learned 
clauses. We also implemented a procedure on the top of the state-of-the-art CDCL SAT 
solver MiniSAT 2.2, denoted CDCL-Enum. In this procedure, each time a model is 
found, we add a no-good and perform a restart. We considered a variety of datasets 
taken from the F IM]0and CP 4 Hyj^repositories. All the experiments were done on Intel 
Xeon quad-core machines with 32GB of RAM running at 2.66 Ghz. For each instance, 
we used a timeout of 15 minutes of CPU time. 

In our experiments, we compare the performances of CDCL-Enum to three variants 
of DPLL-Enum, with different branching heuristics, in enumerating all the models cor¬ 
responding to the closed frequent itemsets. The considered variants of DPLL-Enum are 
the following; 

- DPLL-Enum+VSIDS: DPLL-Enum with the VSIDS branching heuristic; 

' FIMI: http://fimi.ua.ac.be/data/ 

^ CP4IM: http://dtai.cs.kuleuven.be/CP4IM/datasets/ 




- DPLL-Enum+JW: DPLL-Enum with a branching heuristic based on the maximum 

number of occurrences of the variables ini; 

- DPLL-Enum+RAND: DPLL-Enum with a random variable selection. 

Our comparison is depicted by the the cactus plots of Figure Each dots (m, y) 
represents an instance with a fixed minimal support threshold n. Each cactus plot rep¬ 
resents an instance and the evolution of CPU time needed to enumerate all models with 
the different algorithms while varying the quorum. Eor each instance, we tested differ¬ 
ent values of n. The x-axis (respectively y-axis) represents the CPU time (in seconds) 
needed for the enumeration of all closed frequent itemsets. 

Unsurprisingly, the DPLL-like procedures outperform CDCL-Enum on the major¬ 
ity of instances. This shows that a DPLL based approach is more suitable for SAT- 
based itemset mining. Part of explanation lies in the significant number of models. 
Eurthermore, DPLL-Enum+RAND is clearly less efficient than DPLL-Enum+VSIDS 
and DPLL-Enum+JW, which shows that the branching heuristic plays a key role in 
model enumeration algorithms. Moreover, our experiments show that DPLL-Enum+JW 
is better than DPLL-Enum+VSIDS, even if DPLL-Enum+VSIDS compete with the 
procedure DPLL-Enum+JW on datasets such as anneal and mushroom. Indeed, 
DPLL-Enum+JW clearly outperforms DPLL-Enum+VSIDS on several datasets, such 
as chess, kr-vs-kp and splice-1. Note that for theses two data, the solvers 
DPLL-Enum+RAND and CDCL-Enum are not able to enumerate completely the set 
of all models of all considered quorums. 

As a summary, our experimental evaluation suggests that a DPLL-like procedure 
is a more suitable approach when the number of models of a propositional formula is 
significant. It also suggests that the branching heuristic is a key point in such a procedure 
to improve the performance. 

7 Conclusion 

In this paper, we investigated the impact of modern SAT solvers on the problem of 
enumerating all the models of CNE formulas encoding frequent closed itemsets mining 
problem. Our goal is to measure the impact of the classical components of CDCL-based 
SAT solvers on the efficiency of model enumeration. Our results suggest that on formula 
with a huge number of models, SAT solvers must be adapted to efficiently enumerate 
all the models. We showed that the simple DPLL solver augmented with the classical 
Jeroslow-Wang heuristic achieve better performance. 

As a future work, we plan to pursue our investigation in order to find the best heuris¬ 
tics for enumerating models encoding data mining problems. Einding how to efficiently 
integrate clauses learning for model enumeration is another interesting issue. 


References 

1. Rakesh Agrawal, Tomasz Imielihski, and Arun Swami. Mining association rules between 
sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International 
Conference on Management of Data, SIGMOD ’93, pages 207-216, New York, NY, USA, 
1993. ACM. 


Quorum 


Quorum 


german-credit 


australian-credit 




CDCL-fEnum - 
DPLL-Erum+RAND - 
DPLL-Enum+VSIDS - 
DPLL-Enum+JW - 


hepatitis.pdf 



Quorum 

anneal 




Quorum 

mushroom 




heart-cleveland.pdf 


splice-1 


_J_« DPLL-Enum+VSIDS —^ 


6( ' DPLL-Enum+VSIDS —^ 

DrLLCnum.Jl-, 

700 

. DrLLDimiiijw 


600 

. \ \ , , , , 

X , 

^ 500 

A \ , 

\, 

i ‘•00 

. , VA^ .\ . ^ . 

\ /A ■ ■ ■ 

i 

•I 300 

\ \ 

\ 


200 


, , VA , , , - 

100 

, . 

^ , 




400 500 


400 500 


chess 


kr-vs-kp 


Fig. 1. Frequent Closed Itemsets: CDCL vs DPLL-Like enumeration 


























2. Roberto Asin, Robert Nieuwenhuis, Albert Oliveras, and Enric Rodriguez-Carbonell. Cardi¬ 
nality networks: a theoretical and empirical study. Constraints, 16(2): 195-221, 2011. 

3. Armin Biere, Marijn J. H. Heule, Hans van Maaren, and Toby Walsh, editors. Handbook of 
Satisfiability, volume 185 of Frontiers in AI and Applications. lOS Press, 2009. 

4. Pankaj Chauhan, Edmund M. Clarke, and Daniel Kroening. Using sat based image compu¬ 
tation for reachability analysis. Technical report. Technical Report CMU-CS-03-151, 2003. 

5. Emmanuel Coquery, Said Jabbour, Lakhdar Sais, and Yakoub Salhi. A sat-based approach 
for discovering frequent, closed and maximal patterns in a sequence. In Proceedings of the 
20th European Conference on Artificial Intelligence (ECAFll), pages 258-263, 2012. 

6. M. Davis, G. Logemann, and D. W. Loveland. A machine program for theorem-proving. 
Comm, of the ACM, 5(7):394-397, 1962. 

7. Martin Gebser, Benjamin Kaufmann, Andre Neumann, and Torsten Schaub. Conflict-driven 
answer set enumeration. In Chitta Baral, Gerhard Brewka, and John Schlipf, editors, Logic 
Programming and Nonmonotonic Reasoning, volume 4483 of Lecture Notes in Computer 
Science, pages 136-148. Springer Berlin Heidelberg, 2007. 

8. T. Guns, S. Nijssen, and L. De Raedt. Itemset mining: A constraint programming perspective. 
Artif Intell, 175(12-13):1951-1983, 2011. 

9. Rui Henriques, Ines Lynce, and Vasco M. Manquinho. On when and how to use sat to mine 
frequent itemsets. CoRR, abs/1207.6253, 2012. 

10. Said Jabbour, Jerry Lonlac, Lakhdar Sais, and Yakoub Salhi. Extending modern SAT solvers 
for models enumeration. In Proceedings of the 15th IEEE International Conference on In¬ 
formation Reuse and Integration, IRl 2014, Redwood City, CA, USA, August 13-15, 2014, 
pages 803-810, 2014. 

11. Said Jabbour, Lakhdar Sais, and Yakoub Salhi. Boolean satisfiability for sequence min¬ 
ing. In 22nd ACM International Conference on Information and Knowledge Management 
(CIKM’13), pages 649-658. ACM, 2013. 

12. Said Jabbour, Lakhdar Sais, and Yakoub Salhi. The top-k frequent closed itemset mining 
using top-k sat problem. In European Conference on Machine Learning and Knowledge 
Discovery in Databases (ECML/PKDD’03), pages 403^18, 2013. 

13. R. G. Jeroslow and J. Wang. Solving propositional satisfiability problems. Annals of Math¬ 
ematics and Artificial Intelligence, 1:167-187, 1990. 

14. Hoonsang Jin, Hyojung Han, and Eabio Somenzi. Efficient conflict analysis for finding all 
satisfying assignments of a boolean circuit. In In TACAS’05, LNCS 3440, pages 287-300. 
Springer, 2005. 

15. J. P. Marques-Silva and K. A. Sakallah. GRASP - A New Search Algorithm for Satisfiability. 
In Proceedings oflEEE/ACM CAD, pages 220-227, 1996. 

16. Kenneth L. McMillan. Applying sat methods in unbounded symbolic model checking. In 
Proceedings of the 14th International Conference on Computer Aided Verification (CAV’02), 
pages 250-264, 2002. 

17. Antonio R. Morgado and Joao P. Marques-Silva. Good Learning and Implicit Model Enu¬ 
meration. In International Conference on Tools with Artificial Intelligence (ICTAE2005), 
pages 131-136. IEEE, 2005. 

18. L. De Raedt, T. Guns, and S. Nijssen. Constraint programming for itemset mining. In ACM 
SIGKDD, pages 204-212, 2008. 

19. J. P. Marques Silva and 1. Lynce. Towards robust cnf encodings of cardinality constraints. In 
CP, pages 483-497, 2007. 

20. C. Sinz. Towards an optimal cnf encoding of boolean cardinality constraints. In CP’05, 
pages 827-831, 2005. 

21. A. Tiwari, R.K. Gupta, and D.P Agrawal. A survey on frequent pattern mining: Current 
status and challenging issues. Inform. Technol. J, 9:1278-1293, 2010. 



22. G.S. Tseitin. On the complexity of derivations in the propositional calculus. In H.A.O. 
Slesenko, editor. Structures in Constructives Mathematics and Mathematical Logic, Part II, 
pages 115-125, 1968. 

23. J. P. Warners. A linear-time transformation of linear inequalities into conjunctive normal 
form. Information Processing Letters, 1996. 

24. L. Zhang, C. F. Madigan, M. W. Moskewicz, and S. Malik. Efficient conflict driven learning 
in Boolean satisfiahility solver. In lEEE/ACM CAD’2001, pages 279-285, 2001. 



